Optimizing Document Similarity Detection in Persian Information Retrieval

نویسندگان

  • Omid Kashefi
  • Nina Mohseni
  • Behrouz Minaei-Bidgoli
چکیده

Most data on the Web is in the form of text or image. Finding desired data on the Web in a timely and cost-effective way is a problem of wide interest. In the last several years, many search engines have been created to help Web users find desired information. In this paper we present a new technique to eliminate the affixes and their effects on recognizing similar Persian documents. Reviewing affixes’ rules and exceptions in Persian language, we extracted about 300 common inflectional suffixes and their combinations. We evaluate the effectiveness of eliminating the affixes from Persian texts on document similarity using four major document similarity approaches: Latent Semantic Indexing, Shingling, Vector Space Model, and Co-occurrence. Evaluation results demonstrate improvement in retrieval and detection of similar documents after eliminating affixes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Persian Plagiarism Detection Using Sentence Correlations

This report explains our Persian plagiarism detection system which we used to submit our run to Persian PlagDet competition at FIRE 2016. The system was constructed through four main stages. First is pre-processing and tokenization. Second is constructing a corpus of sentences from combination of source and suspicious document pair. Each sentence considered to be a document and represented as a...

متن کامل

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

The task of plagiarism detection entails two main steps, suspicious candidate retrieval and pairwise document similarity analysis also called detailed analysis. In this paper we focus on the second subtask. We will report our monolingual plagiarism detection system which is used to process the Persian plagiarism corpus for the task of pairwise document similarity. To retrieve plagiarised passag...

متن کامل

Creating a Persian-English Comparable Corpus

Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in Englis...

متن کامل

Graph-based Approach to Text Alignment for Plagiarism Detection in Persian Documents

This paper presents a new approach for Persian plagiarism detection. This approach uses a graph structure as well as one of the graph similarity methods (iterative methods) for similarity detection of two Persian documents. In this approach, documents are represented by a graph with specified length, then each part of suspicious document is compared to that of the source document. The graph is ...

متن کامل

بررسی نقش انواع بافتار هم‌نویسه‌ها در تعیین شباهت بین مدارک

Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JCIT

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2010